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ABSTRACT 

Web 2.0 applications have attracted a considerable amount of at- 
tention because their open-ended nature allows users to create light- 
weight semantic scaffolding to organize and share content. To date, 
the interplay of the social and semantic components of social me- 
dia has been only partially explored. Here we focus on Flickr and 
Last.fm, two social media systems in which we can relate the tag- 
ging activity of the users with an explicit representation of their 
social network. We show that a substantial level of local lexical 
and topical alignment is observable among users who lie close to 
each other in the social network. We introduce a null model that 
preserves user activity while removing local correlations, allowing 
us to disentangle the actual local alignment between users from sta- 
tistical effects due to the assortative mixing of user activity and cen- 
trality in the social network. This analysis suggests that users with 
similar topical interests are more likely to be friends, and therefore 
semantic similarity measures among users based solely on their an- 
notation metadata should be predictive of social links. We test this 
hypothesis on the Last.fm data set, confirming that the social net- 
work constructed from semantic similarity captures actual friend- 
ship more accurately than Last.fm's suggestions based on listening 
patterns. 
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1. INTRODUCTION 

Social networking systems like Facebook and systems for con- 
tent organization and sharing such as Flickr and Delicious have 
created information-rich ecosystems where the cognitive, behav- 
ioral and social aspects of a user community are entangled with 
the underlying technological platform. This opens up new ways to 
monitor and investigate a variety of processes involving the inter- 
action of users with one another, as well as the interaction of users 
with the information they process. Social media supporting tag- 
ging 1 14, 3 1 are especially interesting in this respect because they 
stimulate users to provide light-weight semantic annotations in the 
form of freely chosen terms. Usage patterns of tags can be em- 
ployed to monitor interest, track user attention, and investigate the 
co-evolution of social and semantic networks. 

While the emergence of conventions and shared conceptualiza- 
tions has attracted considerable interest |24||16|[25| [2), little atten- 
tion has been devoted so far to relating, at the microscopic level, the 
usage of shared tags with the social links existing between users. 
The present paper aims at filling this gap. To this end we focus on 
Flickr and Last.fm, as to our knowledge they are currently the only 
popular social media system where: (1) a significant fraction of the 
users provide tag metadata for their content (photographs or songs), 
and (2) an explicit representation of the social links between users 
is readily available. 

The main question that we address in this study is the follow- 
ing: given two randomly chosen users, how does the alignment of 
their tag vocabularies relate to their proximity on the social net- 
work? That is, does lexical alignment exist between neighboring 
users, and if so, how does this alignment fade when considering 
users lying at an increasing distance on the social graph? And if 
indeed such a relationship exists, does it allow us to predict so- 
cial links from analysis of the semantic similarity among users, ex- 
tracted from their annotations? 

Contributions and outline 

The main contributions of this paper are summarized as follows: 

• In § |4. Ij we show that strong correlations exist across sev- 
eral measures of user activity, and characterize the mixing 
patterns that involve user activity and user centrality in the 
social network. 



• In § |4.2| we develop sound measures of tag overlap. We fur- 
ther introduce appropriate null models to disentangle the ac- 
tual local alignment between users from statistical effects due 
to the mixing properties of user activity and centrality in the 
social network. We apply these measures to the Flickr and 
Last.fm data sets. The resulting analysis shows that, despite 
neither Flickr nor Last.fm support globally-shared tag vocab- 
ularies, a substantial level of local lexical (shared tags) and 
topical (shared groups) alignment is observable among users 
who are close to each other in the social network. We also 
find that some observables are more adequate than others to 
measure lexical and topical alignment, in the sense that they 
are less sensitive to purely statistical effects. 

• In §|5]we inquire if the observed correlations between anno- 
tation metadata and social proximity allow to use semantic 
similarity between user annotations as statistical predictors 
of friendship links. We evaluate a number of semantic sim- 
ilarity measures from the literature, based on Last.fm meta- 
data. We find that when we consider the annotations of the 
most active users, almost all of the semantic similarity mea- 
sures considered outperform the neighbor suggestions from 
the Last.fm system at predicting actual friendship relations. 
Scalable semantic similarity measures such as Maximum In- 
formation Path, proposed by some of the authors, are among 
those achieving the best predictive performance. 

2. RELATED WORK 

One of the first quantitative studies on Flickr is presented by 
Marlow et al. |13| , who discuss the heterogeneity of tagging pat- 
terns and perform a preliminary analysis of vocabulary overlap be- 
tween pairs of users. The analysis suggests that users who are 
linked in the Flickr social network have on average a higher vo- 
cabulary overlap, but no assessment is made of biases and other 
correlations that could be responsible for the reported observation. 

The structure and the temporal evolution of the Flickr social net- 
work are investigated in several papers j4] [18] |I7| . Leskovec et 
al. f6] place a special emphasis on the local mechanisms driving 
the microscopic evolution of the network. 

The role of social contacts in shaping browsing patterns on Flickr 
has also been explored |5jj26J, providing insights into the behavior 
and activity patterns of Flickr users. 

Prieur ef a/. | 2l | investigate the role of Flickr groups as coordina- 
tion tools, and explore the relation between the density of the social 
network and the density of the network of tag co-usage among the 
group members. 

Liben-Nowell and Kleinberg 1 8 1 explore several notions of node 
similarity for link prediction in social networks. In our own prior 
work we performed a systematic analysis of a broad range of se- 
mantic similarity measures based on folksonomies, that can be ap- 
plied directly to build networks of users, tags, or resources | [T2l|10[ 
|11| . Here we build upon this evaluation framework. 

Li et al. [7 1 propose a system to discover common user interests 
and cluster users and their saved URLs by different interest topics. 
They use a Delicious data set to define implicit links between users 
based on the similarity of their tags. However they do not coiTelate 
the interest clusters with social connections. 

Perhaps the work that most directly relates to our approach is 
by Santos-Neto et al. |23| , who explore the question of whether 
tag-based or resource-based interest sharing in tagging systems re- 
late to other indicators of social behavior. The authors analyze the 
CiteULike and Connotea systems, which deal with scientific publi- 
cation and lack explicit social network components. Therefore they 



are unable to directly explore social friendship between two users, 
and instead look at participation in the same discussion group, with 
mixed results. They do not find a statistical correlation between 
the intensity of interest sharing and the collaboration levels. Our 
present results are both more explicit (we deal with pairs of users 
rather than groups) and more conclusive. Furthermore we are able 
to evaluate our interest-based predictions against external sugges- 
tions based on independent data, quantifying the applicative value 
of our findings. 

3. DATA SETS 

Flickr makes available most of its public data by means of API 
methods (flickr. com/ api ). The data used for the present study 
were obtained by using the public Flickr API to perform a dis- 
tributed crawl of the content uploaded to Flickr between January 
2004 and January 2006. The system was crawled during the first 
half of 2007. The crawling task was distributed by dividing the 
above interval of time into work units consisting of smaller time 
windows, and crawling each time window separately. Each crawler 
was programmed to issue search queries for every known tag, lim- 
ited to its temporal window of competence, as well as to issue 
search queries for photos uploaded in the same interval. As new 
tags were discovered, they were added to a global table shared by 
all the crawlers. Separate crawls were performed to explore the 
Flickr social network (in Flickr jargon, the "contacts" of a given 
user are her nearest neighbors in the social network, as represented 
in the system), as well as group membership information. 

Overall, the data set we analyzed comprises 241, 031 users for 
whom we have tagging information, and 118, 144 users for whom 
we also have group membership information. Our analysis will 
focus on two networks. The first one. Go, comprises the Flickr 
users for whom we have tag, group and contact information. It 
consists of 118, 144 nodes (users) and 2, 263, 182 edges (contacts 
between users). The second network, Gi, is obtained by extending 
Go to include all the neighbors of its nodes, neighbors for whom 
we may not have tagging, group membership, or complete contact 
information. Gi comprises 983, 778 nodes and 16, 673, 476 edges 
and will be used to check the robustness of analyses involving the 
distance among users in the Flickr social network. 

Similarly, we constructed our Last.fm data set using public API 
methods (last . f m/api I, in particular for collecting neighbor 
and friend relations. In Last.fm jargon, friends are contacts in the 
social networks, while neighbors are users recommended by the 
system as potential contacts, based on their music playing histo- 
ries. Last.fm also allows users to annotate various items (songs, 
artists, or albums) with tags. However, the Last.fm API does not 
allow to retrieve the complete user annotation activity. Therefore, 
with permission, we developed a crawler that extracts all the triples 
(user, item, tag) and group membership information by visiting and 
parsing user profile web pages. The crawls took place over a pe- 
riod of a few months in the first half of 2009. The resulting data set 
comprises of 99, 405 users, of which 52, 452 are active, i.e., have at 
least one annotation. The 10, 936, 545 triples annotate 1, 393, 559 
items with 281, 818 tags. The users belong to 66, 429 groups. 

No filters were applied to our data set collections. 

4. DATA ANALYSIS 

In this section we present an analysis of the data. The very same 
analysis was carried out for both Flickr and Last.fm data sets. How- 
ever, due to space limitations, we report below mainly on the re- 
sults of the Flickr analysis. Unless otherwise specified our analy- 
sis refers to Go but we checked that the results do not change for 
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Figure 1: Flickr distributions of (A) tlie number k of neighbors 
of a user, (B) the number Ug of groups of which a user is a 
member, (C) the number nt of distinct tags per user, and (D) 
the number a of tag assignments per user. 



Table 1: Averages and fluctuations of Flickr user activity 



Measure of activity x 


Average (x) 


{x')/{x) 


k 


38.3 


469 


nt 


85.3 


511.4 


rig 


32.6 


184.6 


a 


690.7 


8471.3 



Gi. The analysis of Last.fm yielded analogous results, both qual- 
itatively and quantitatively. Therefore we believe our conclusions 
to be robust. 

4.1 Heterogeneity and correlations 

Let us first focus on the activity of users as measured by a number 
of indicators, and investigate the correlations between these indica- 
tors. The activity of a Flickr user has indeed various aspects, among 
which the most important are uploading photos, tagging them, par- 
ticipating in groups, commenting on other users' photos, and other 
social networking activities. Fig. [T| displays the probability distri- 
butions of the number k of neighbors in the social network (the 
degree fc of a node), and the probability of finding a user with a 
given number rit of distinct tags in her vocabulary. The breadth of 
a user's tag vocabulary can be regarded as a proxy for the breadth 
of her interests. We also show in Fig.[T]the distribution of the num- 
ber of groups Ug to which a user belongs, and of the total number 
a of tags assignments submitted by a user (in this case, a tag used 
twice by a user is counted twice). More precisely, if fu{t) is the 
number of times that a tag t has been used by user it, then the to- 
tal number of tag assignments of user u is a„ — "^^t fuit)- AH 
these distributions are broad, showing that the activity patterns of 
users are highly heterogeneous. For reference. Table [T] reports the 
averages and variances of these quantities. 

A few comments are in order First, in our analysis we do not 
consider one obvious measure of activity, namely the number of 
photos uploaded by a user One reason for this is that the number of 
photos posted by a user is known to be strongly correlated with the 
number of tags from the same user [IS] . More importantly, Flickr 
is a "narrow folksonomy" in which users tend to tag mostly their 
own content |28]. Thus, when exploring the similarity of users and 
relating it to the underlying social network, shared usage of tags 
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Figure 2: Average number of distinct tags (nt), of groups (rig), 
and of tag assignments (a) of users having k neighbors in the 
Flickr social network. 



and co-membership in groups are natural and more direct indicators 
of shared interests. Another note concerns the comparison with the 
study by Mislove et al. [18 J, who reported a smaller average degree 
for the Flickr social network. This difference is due to the fact that 
in our study we focus on those users who use both tags and groups. 
Since only 21% of the users participate in groups [181, this means 
that here we are focusing on the "active" users. As we will see 
below, the various activity metrics are correlated with one another, 
so users who are more active in terms of tags and groups will tend 
to have more contacts in the Flickr social network, hence the higher 
average degree we report here. Fig.[T] however, clearly shows that 
even within this "active" set of users, very broad distributions of 
activity patterns are observed and no "typical" value of the activity 
metrics can be defined. 

It seems natural to ask whether the different types of activity 
measures are correlated with one another and with the structure 
of the social network: are users with more social links also more 
active in tagging their content, and do they participate in more 
groups? As shown in Fig. |2] the data show that this is indeed the 
case (see also Ref. |13|). Fig. |2] displays the average activity of 
users with k neighbors in the social network, as measured by the 
various metrics defined above. For instance. 



nt(k) = 



1 



5Z 



All types of activity have an increasing trend for increasing val- 
ues of fc, and of course strong fluctuations are present at all values 
of k. The strong fluctuations visible for large fc values are due to 
the fewer highly-connected nodes over which the averages are per- 
formed. Notably, users with a large number of social contacts but 
using very few tags and belonging to very few groups can be ob- 
served. Despite these important heterogeneities in the behavior of 
users with the same degree fc, the data clearly indicate a strong cor- 
relation between the different types of activity metrics. The Pear- 
son correlation coefficients are: 0.349 between fc and nt, 0.482 
between fc and Ug, 0.268 between fc and a, 0.429 between nt and 
rig, 0.753 between nt and a, 0.304 between rig and a. 

Another important question concerns the correlations between 
the activity metrics of users who are linked in the Flickr social net- 
work. This is a well-known problem in the social sciences, ecology, 
and epidemiology: a typical pattern, referred to as "assortative mix- 
ing," describes the tendency of nodes of a network (here, the users) 



to link to other nodes with similar properties. This appears intuitive 
in the context of a social network, where one expects individuals to 
connect preferentially with other individuals sharing the same in- 
terests. Likewise, it is possible to define a "disassortative mixing" 
pattern if the elements of the network tend to link to nodes that 
have different properties or attributes. Mixing patterns can be de- 
fined with respect to any property of the nodes | ,19J . In the present 
case, we characterize the mixing patterns concerning the various 
activity metrics. 

In the case of large scale networks, the most commonly inves- 
tigated mixing pattern involves the degree (number of neighbors) 
of nodes. This type of mixing refers to the likelihood that users 
with a given number of neighbors connect with users with simi- 
lar degree. To this end, a commonly used measure is given by the 
average nearest neighbors degree of a user u, 



k. 



where the sum runs over the set V{u) of nearest neighbors of u. To 
characterize mixing patterns in the degree of nodes, a convenient 
measure can be built on top of by averaging over all nodes u 
that have a given degree k [20l|27|: 



knn (^) 



\u : ku = k\ 



(1) 



u: k^—k 



In the case of Flickr, each user is endowed with several proper- 
ties characterizing its activity. It is thus interesting to characterize 
the mixing patterns with respect to all of those properties. To this 
end, we generalize the average nearest neighbors degree presented 
above, and define for each user u the following quantities: (i) the 
average number of tags of its nearest neighbors, 

" u6V{u) 

(ii) the average total number of tags used by its nearest neighbors. 



and (iii) the average number of groups to which its nearest neigh- 
bors participate, 



ku 



u6V{u) 



To detect the mixing patterns, in complete analogy with the case 
of kn„{k), we compute the average number of distinct tags of the 
nearest neighbors /or the class of users having rit distinct tags: 



nt,nn{n) 



\u : nt(u) — n\ 



Ell 



(2) 



u:nt (u) — n 



the average total number of tags used by the nearest neighbors /or 
the class of users with a tag assignments: 



Ann (a) = 



\u : a{u) = a 



E 



(3) 



u:a{u) — a 



and the average number of groups of the nearest neighbors for the 
class of users who are members of rig groups: 



1 



\u : ng{u) — n\ 



Eu 
IT'g.nn 

u-.rig {u)=n 



(4) 
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Figure 3: (A) Average degree of the nearest neighbors of nodes 
of degree k, computed for Go (black circles) Gi (red crosses) 
Flickr networks. (B) Average number of groups for the near- 
est neighbors of nodes belonging to Ug groups. (C) Average 
number of tags for the nearest neighbors of nodes with rit tags. 
(D) Average total number of tag assignments for the nearest 
neighbors of nodes with a tag assignments. In all cases a clear 
assortative trend is observed. 



Fig.[3]displays the quantities of Eqs.[TJQfor the Flickr data set. 
In all cases, a clear assortative trend is visible: the average activity 
of the neighbors of a user increases with the user's own activity, 
for all the activity measures we computed. Note that for the degree 
mixing patterns, the assortative trend is even enhanced when Gi 
is considered instead of Gq. Large fluctuations are observed for 
large activity values, because of the small number of very active 
users. We remark that while Mislove et al. 1 18 1 had already found 
an assortative trend with respect to the degree of the social network. 
Fig. [3] highlights that the activities of socially connected users are 
correlated at all levels. 

4.2 Lexical and topical alignment 

In this section we analyze more in detail the similarity of user 
profiles in relation to their social distance. More precisely, the pre- 
vious section was devoted to the correlations between the intensity 
of user activities, as quantified by several metrics. We now focus 
on the similarity between user profiles as measured by the similar- 
ity of their respective tag vocabularies, and by the similarity of the 
set of groups they belong to. 

As mentioned above, Flickr is a "naiTow folksonomy" p8) : tag 
annotations are provided mostly by the content creator, i.e., the tags 
associated with a photo are typically provided by the user who 
posted that photo. Intuitively, the absence of shared content, to- 
gether with the very personal character of both the content and the 
tag metadata, make the Flickr tag vocabulary extremely incoherent 
across the user community. Conversely, social bookmaking sys- 
tems like Delicious allow multiple users to annotate the same re- 
source and one could argue that the browsing experience exposes 
users to the global tag vocabulary and fosters — at least in princi- 
ple — imitative or cooperative processes leading to the emergence 
of global conventions in the user community 1 14|. 

In light of the above observations, we do not expect to observe a 
globally shared tag vocabulary in the Flickr community. A simple 
test for the existence of such a globally shared vocabulary can be 
performed by selecting pairs of users at random and measuring the 
number of tags they share, Ust- It turns out that the average number 



Table 2: Tags most frequently used by three Flickr users 



User A 


Users 


User C 


green 


flower 


japan 


red 


green 


tokyo 


catchycolors 


kitchen 


architecture 


flower 


red 


bw 


blue 


blue 


setagaya 


yellow 


white 


reject 


catchcolors 


fave 


sunset 


travel 


detail 


subway 


london 


closeupfilter 


steel 


pink 


metal 


geometry 


orange 


yellow 


foundart 


macro 


zoo 


canvas 



of shared tags is only {ust) ~ 1-6. The most common case (mode) 
is in fact the absence of any shared tags; this occurs with probability 
close to 2/3 among randomly chosen pairs of users. 

One can nevertheless expect that a number of mechanisms may 
lead to local alignment of the user profiles, in terms of shared tags 
and/or groups membership. The presence of a social link, in fact, 
indicates a priori some degree of shared context between the con- 
nected users, which are likely to have some interests in common, 
or to share some experiences, or who are simply exposed to each 
other's content and annotations. As an example, Table[2]shows the 
12 most frequently used tags for three Flickr users with compara- 
ble tagging activity. User A and user B have marked each other as 
friends, while user C has no connections to either A or S on the 
Flickr social network. All of these users have globally popular tags 
in their tag vocabulary. In this example, the neighbors A and B 
share an interest (expressed by the tag flower) and several of the 
most frequently used tags (marked in bold). 

Regardless of the mechanism driving this potential local align- 
ment, in the following we want to measure this effect for the case 
of tag dictionaries and group memberships, and put it into relation 
with the distances between users along the social network. This ap- 
proach is similar to the exploration of topical locality in the Web, 
where the question is whether pages that are closer to each other in 
the link graph are more likely to be related to one another |15| . 

First, we must define robust measures of vocabulary similarity 
and group affiliation similarity between two users u and v. The 
first and simplest measures are given by the number of shared tags 
Ust among the tag vocabularies of u and v, and by the number of 
shared groups Usg to which both u and v belong. These measures, 
however, are not normalized and can be affected by the specific 
activity patterns of the users: two very active users may have more 
tags in common than two less active users, just because active users 
tag more, on average. We therefore consider as well a distributional 
notion of similarity between the tag vocabularies of u and v. Fol- 
lowing Ref. yj we regard the vocabulary of a user as a feature vec- 
tor whose elements correspond to tags and whose entries are the 
tag frequencies for that specific user's vocabulary. To compare the 
tag feature vectors of two users, we use the standard cosine simi- 
larity |22|. Denoting by fu {t) the number of times that tag t has 
been used by user u, the cosine similarity atagsiu, v) is defined as 



s{u,v) = 



(5) 



This quantity is if u and v have no shared tags, and 1 if they have 
used exactly the same tags, in the same relative proportions. Be- 
; of the normalization factors in the denominator, at 




Figure 4: Top: average number of shared tags (n^t) for two 
Flickr users as a function of their distance d along the social 
network. Bottom: average cosine similarity {otags) between 
the tag vocabularies of two Flickr users as a function of d. In 
both cases data for the same social network with reshuffled tag 
vocabularies are shown. 



is not directly influenced by the global activity of a user. 

Similarly, we can define the cosine similarity (Jgroups for groups 
memberships. Since a user belongs at most once to a group, this 
reduces to 



^groups ('fi; ^) — 



^Jng(u)ng(v) 



(6) 



cause 



tag 



,{u,v) 



where 5!^ — lif u belongs to group g and otherwise. 

To compute averages of these similarities, we randomly chose 
N = 2 X 10* users and explored their neighborhoods in a breadth- 
first fashion. In order to exclude biases due to this sampling, we 
also performed an exhaustive investigation of the social network 
neighborhoods up to distance 2 from each user, obtaining the same 
results. Moreover we considered the distances along the social net- 
work using Gi instead of Go, and again found the same results, 
showing the robustness of the observed behavior with respect to 
possible sampling biases due either to the crawl or to considering 
only users having both tagging activity and groups memberships. 

Figures |4] and [5] give an indication of how the similarity be- 
tween users depends on their distance d along the social network, 
by showing the average number of shared tags, of shared groups, 
and the corresponding average cosine similarities, of two users as a 
function of d. While the average number of shared tags or groups 
is quite large for neighbors (respectively close to 20 and to 10), it 
drops rapidly (exponentially) as d increases, and is close to for 
d > 4. 

Figures[4]and[5]provide a strong indication that a certain degree 
of alignment between neighbors in the social network is observed 
both at the lexical level and for group affiliations. As soon as the 
distance between two users on the social distance is not 1 (neigh- 
bors) or 2 (neighbors of neighbors) however, it becomes highly 
probable that these users have neither tags nor groups in common. 
Therefore the alignment is a strongly local effect. 

The analysis of the mixing patterns of the social network per- 
formed previously leads us to investigate in more detail this local 
alignment. This analysis has indeed shown the presence of a strong 
assortativity with respect to the intensity of the users' activity. It 




Figure 5: Top: average number of shared groups (risg) for two 
Flickr users as a function of their distance d along the social 
networli. Bottom: average cosine similarity of the group afflli- 
ation (agroups) vs d. In both cases data for the same social net- 
work with reshuffled group affiliations (preserving the number 
of groups for each user) are shown. 
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Figure 6: Top: Probability distributions of the number of 
shared tags of two users being at distance d on the social net- 
work, for d = 1 and d = 2 (symbols), and for the same network 
with reshuffled tags (lines). Bottom: same for the distributions 
of cosine similarities of the tag vocabularies. 



could therefore be the case that such assortativity, by a purely sta- 
tistical effect, yields an "apparent" local alignment between the tag 
vocabularies of users. For example, even in a hypothetical case of 
purely random tag assignments, it would be more probable to find 
tags in common between two large tag vocabularies than between 
a small one and a large one. 

In order to discriminate between effects due to actual lexical and 
group membership similarity and those simply due to the assorta- 
tivity, it is important to devise a proper null model, i.e. to construct 
an artificial system that retains the same social structure as the one 
under study, but lacks any lexical or topical alignment other than 
the one that may result from statistical effects. This is done by 
keeping fixed the Flickr social network and its assortativity pattern 
for the intensity of activity, but destroying socially-related lexical 
or topical alignments by means of a random permutation of tags 
among themselves and groups among themselves. More precisely, 
we proceed in the following fashion: (i) we keep the social network 
unchanged; (ii) we build the global list of tags with their multi- 
plicity, i.e. each tag appears the total number of times it has been 
used; (iii) for each user with nt tags ti,t2 ■ ■ - tn^, with respective 
frequencies /i, /2, . . . , fn^, we extract nt distinct tags at random 
from the global list of tags and assign them to u with frequencies 
/i, /2, . . . , /nt. This guarantees that the number of distinct tags 
and the total number of tag assignments for each user is the same 
as in the original data, and that the distribution of frequencies of 
tags is left unchanged. Clearly, this null model preserves the assor- 
tativity patterns with respect to the amount of activity of users, as 
each user has exactly the same number of distinct tags and of tag 
assignments as in the real data. However, correlations between the 
tag vocabularies are lost, except the ones purely ascribed to statis- 
tical effects. 

For group membership, we can proceed in a similar way: (i) we 
build a list of groups with a multiplicity equal to the number of 
users of each group (i.e., a group appears n times in the list if it has 
n users); (ii) for each user u belonging to rig groups, we extract 
at random Ug (distinct) groups from the list and assign them to u. 
As for the tags, this procedure preserves the number of groups for 



each user, as well as the statistics of the number of users per group, 
while destroying correlations between users' group memberships. 

The goal of the null model is to determine the amount of lexical 
and topical alignment due to spurious activity correlations. Elim- 
inating such spurious coiTelations is analogous in purpose to the 
use of inverse document frequency (IDF) in information retrieval. 
IDF discounts the contribution of globally common terms in as- 
sessing the similarity between documents. Such terms are likely to 
be shared by pairs of documents solely because of their statistical 
prevalence. Unlike in information retrieval, it is not straightfor- 
ward to apply this type of discounting in social annotations. One 
would first need to determine whether to discount tags based on 
their prevalence among users or among resources. The null model 
destroys all spurious correlations regardless of their source. 

Using the null model, we measured the alignment between users 
at distance d on the social network in the same way as for the orig- 
inal data. As Figs. |4] (top) and |5] (top) show, the average number 
of shared tags or of shared groups, as a function of the distance 
d, shows a similar trend to the original (non-reshuffled) data. For 
neighbors and next to nearest neighbors (d < 3), the average num- 
bers of shared tags or groups are lower in the null model than in the 
original data, but still significantly higher than for users at larger 
distances. The assortative mixing between the amount of activity 
of neighboring users is therefore enough to yield a strong lexical 
and topical alignment as simply measured by the number of shared 
tags or groups. The case of cosine similarity is quite different. 
As shown in Fig. |4] (bottom), the average cosine similarity is very 
small in the null model, and does not depend strongly on the dis- 
tance in the social network. Therefore local lexical alignment is 
a real effect: friends are more likely to use similar tag patterns. 
With respect to the topical alignment, a certain — albeit weaker — 
dependence of (agroups) on d is visible in Fig.|5](bottom). 

We also analyzed the distributions of Ust, atags,nsg, and agroups 
among users at fixed distance d, for both the original and the reshuf- 
fled data. For brevity we show only the distributions of rist and 
atags for d = 1 and d = 2 in Fig.|6] The distributions of Ust are 




Figure 7: Average number of shared tags (top) and average co- 
sine similarity between tag vocabularies (bottom) for pairs of 
Last.fm users as a function of their social distance. We also 
show data for the same social network with reshuffled tag vo- 
cabularies. 

very similar for the original and the reshuffled data, while for the 
cosine similarity they are clearly different: a much stronger local 
alignment occurs in the original data. 

As mentioned earlier, analogous results are found by analyzing 
our Last.fm data set. For illustration purposes we just show in Fig.|7] 
the dependencies of local tag alignment measures on social dis- 
tance. Again cosine similarity is the more robust measure. 

Our investigation of the lexical and topical alignment patterns in 
Flickr and Last.fm reveals therefore the following picture. The var- 
ious measures of the topical and lexical overlap between users as a 
function of their distance along the social network clearly point to- 
ward a partial local alignment, which persists up to distances 2 — 3, 
even if large values can occasionally still be observed at larger dis- 
tances. Interestingly, if the number of shared tags between users is 
the only retained measure, a reshuffling of tags and groups between 
users shows that a large part of the alignment is simply due to the 
assortative pattern concerning users' amounts of activity. This re- 
sult highlights the importance of considering appropriate null mod- 
els to discriminate between purely statistical effects and real lexi- 
cal or topical alignments. It also shows that correctly normalized 
similarity measures such as cosine similarity, which factor out the 
effects of vocabulary size, are more appropriate for such investiga- 
tions, since they are less affected by the assortativity patterns. 

5. PREDICTING SOCIAL LINKS 

The analysis in the previous section strongly suggests that users 
with similar topical interests, as captured by shared tags in particu- 
lar, are more likely to be neighbors in the social network. Therefore 
a natural question is whether semantic similarity measures among 
users based solely on their annotation patterns can be employed as 
accurate predictors of friendship links. We tested this hypothesis 
on both our Flickr and Last.fm data sets, because each provides an- 
notation metadata needed to compute similarity as well as a social 
network to evaluate the accuracy of the predictions. For brevity 
we focus on reporting the results for the Last.fm data, which are 
more interesting for two reasons. First, contrary to Flickr, Last.fm 
is a "broad folksonomy" in which different users can easily anno- 
tate the same songs, artists, or albums. This allows us to compute 



similarity based on shared content as well as shared vocabulary. 
Second, Last.fm provides neighbor recommendations. Neighbors 
are users with a similar music taste, based on listening patterns. 
The neighborhood relation is therefore independent of the explicit 
friendships established by the users, and provides an obvious gauge 
against which to evaluate any algorithm to predict social links. Ex- 
cept for the lack of such a comparison measure in Flickr (beyond 
the random choice baseline), and for not considering similarity 
measures based on shared items in Flickr, the prediction analysis 
yields consistent and encouraging results using both data sets. 

5.1 Overview of semantic similarity measures 

In prior work |12[|10[pT| we evaluated a number of social sim- 
ilarity measures based on folksonomies, i.e., on annotations repre- 
sented as triples (user, item, tag) where Flickr photos and Last.fm 
songs are instances of items. All of these social similarity mea- 
sures have the desirable property of being symmetric in the sense 
that they can be directly applied to compute the similarity between 
two items, two tags, or two users from a folksonomy. Therefore 
we employ several of these measures here to predict social network 
links from the similarity among users. We summarize below a few 
main features of the proposed user similarity measures; for further 
details and examples see Refs. (10[|TT) . 

We consider two aggregation schemes. In distributional aggre- 
gation, we project along one of the dimensions keeping track of 
frequencies. For example, projecting onto items, a user u is rep- 
resented as a tag vector whose component {t) is the number of 
items tagged by u with t. Analogously we can project onto tags rep- 
resenting users as item vectors. Unfortunately distributional aggre- 
gation requires that all similarities be recomputed for any change 
in annotations, leading to quadratic runtime complexity. 

In collaborative aggregation, first we pick a feature (tag or item) 
and for each value of this feature we represent each user as a list of 
values of the other feature (items or tags). Then we compute a dif- 
ferent similarity value between two users according to each of these 
lists. Finally we aggregate these similarities by voting (summing). 
For example, for each tag we can compute a similarity value based 
on item lists. These are then summed across tags to obtain the fi- 
nal similarity. Analogously we can compute similarities from tag 
lists and sum them across items. Collaborative aggregation has two 
advantages. First, it can be integrated with collaborative filtering 
techniques (hence the name) by a judicious definition of conditional 
probabilities p{item\tag) oi p{tag\item). This makes collabora- 
tive similarity measures competitive with distributional measures 
in terms of accuracy |10|. Second, similarities based on collab- 
orative aggregation can be updated incrementally, in linear time. 
When a triple is added or deleted, only similarities involving the 
item or tag in that triple need be updated. As a result, collabora- 
tive aggregation leads to scalable social similarity measures. Each 
aggregation scheme has two variants, depending on whether we 
project onto/aggregate across tags or items. 

For each aggregation scheme/variant we consider six measures: 
cosine, overlap, matching, Dice and Jaccard coefficients, and max- 
imum information path (MIP). Note that distributional cosine with 
projection onto tag vectors is the ataga measure discussed in the 
previous section. MIP is a generalization of Lin's similarity (9) to 
the non-hierarchical triple representation |11|. For example, the 
distributional version of MIP with aggregation across items is de- 
fined as 

^Mip , ^ x ^ 21og(mintgTinT2PM) 

"■""^ ^' ^ log(mintgTi p[t]) + log(mint6T2 pM) 

where Ti is the set of tags used by m and p[t] is the fraction of 



users annotating with tag t. For aggregation across tags tlie defi- 
nition is analogous except tliat we look at probabilities of shared 
items. For the collaborative version projecting onto items, say, we 
would similarly define afiJ'J^g (ui , U2 ; r) for each item r replac- 
ing Ti by the set T[ of tags used by Ui to annotate r, and replac- 
ing p[t] by a suitably defined p[tlr]. Finally aftJJ^i,{ui,U2) — 
X],- '^itansi'iJ-iTtJ-i', r). Among the measures discussed in Ref. jjlOj 
we did not consider mutual information due to its higher computa- 
tional complexity. In addition to these 6 x 2 x 2 = 24 measures, we 
also consider for comparison purposes the affinity score provided 
by Last.fm for the 60 top neighbors of each user. As mentioned ear- 
lier, this score is based on similar music taste and computed from 
listening patterns. 

5.2 Methodology 

The evaluation consists in selecting a set of pairs of users, com- 
puting each similarity measure for each pair, and adding social 
links between users in decreasing order of their topical similarity: 
the pairs of users with highest similarity are those we predict to 
be most likely friends. For each predicted social link, we check 
the actual social network to see if the prediction is correct. As one 
decreases the similarity threshold more links are added, leading to 
more true positives but also more false positives. The best similar- 
ity measure is the one that achieves the best ratio of true positive 
to false positive rate across similarity values, as illustrated by ROC 
plots and quantified by the area under the ROC curve (AUC). 

To sample the pairs of users from our data set, we start by sorting 
the users by one of three different criteria: 

1. Most Active: By number of annotations; 

2. Most Connected: By number of friends; 

3. Random: By shuffling. 

The set P of pairs is then constructed according to the following 
algorithm: 

repeat : 

pick next u by sorting criterion 
R ■(— set of 60 neighbors of u 
for each n from R: 
if n is active: 

P (u, n) 

stop when |P| = M 

Recall that users are considered active if they have at least one an- 
notation. This is a requirement in order to compute topical simi- 
larity. The choice to select pairs among neighbors stems from the 
goal of comparing the accuracy of topical similarity methods with 
Last.fm recommendations. Given the sparsity of the social and 
neighbor networks, comparative evaluation would be impossible 
without such a sampling. Note that this sampling strategy may bias 
the evaluation in favor of Last.fm's neighbor recommendations, be- 
cause if two active neighbors are friends, they are guaranteed to be 
detected while two active friends who are not neighbors would be 
missed by our sampling even if they were detected by our similar- 
ity measures. Therefore our sampling algorithm is a conservative 
choice in that it does not unfairly help our similarity measures in 
the evaluation^ We experimented with sets of pairs of cardinality 
M — 1, 000 and M — 2, 500. The results are similar; we report 
below on evaluations with 1,000 pairs. 



'At press time Last.fm has released a new API functionality, called 
Tasteometer, to query the affinity score for arbitrary user pairs. This 
will allow us to sample users independently of neighborhood rela- 
tions in future evaluations. 
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Figure 8: ROC curves comparing tlie social linli predictions of 
distributional and collaborative MIP with the Last.fm recom- 
mendations. Triples can be aggregated across items (left) or 
tags (right). Users are sampled from the most active (top), the 
most connected (middle), or at random (bottom). 



5.3 Results 

The best results are obtained by sampling the most active users. 
This is not surprising, as the topical similarity measures have more 
evidence at their disposal from the users' metadata. In Fig. [8] we 
show ROC plots for the MIP measures, which perform consis- 
tently well (among the top 3 measures) in all conditions. While 
Last.fm neighbor recommendations do perform better than the ran- 
dom baseline, topical similarity is much more accurate than mu- 
sic taste in predicting friends for the most active users. The high- 
est accuracy is achieved by aggregating across items, i.e. repre- 
senting users as vectors of tags. For the most connected users as 
well as randomly selected users, the topical similarity measures 
still perform significantly better than the random baseline, but only 
marginally better than Last.fm neighbor recommendations. Let us 
therefore focus on the most active users to evaluate the predictions 
of additional measures. 

Since it is difficult to compare 25 ROC plots, let us summarize 
our results as follows. For each of the 24 topical similarity mea- 
sures, a, we compare the area under the ROC curve with that ob- 
tained by the Last.fm neighbor recommendations. We measure the 
relative improvement AUC (a) /AUC (Last.fm) — 1. A positive 
number indicates higher accuracy than Last.fm in the sense of a 
larger number of true positives for the same number of false posi- 
tives. Fig. |9] shows that all topical similarity measures outperform 
the Last.fm neighbor recommendations. The lonely exception is 
distributional item overlap, for which the improvement is not sig- 
nificant. For distributional measures, aggregation across items (fo- 
cusing on shared tags) yields better predictions. Overall, the best 
accuracy is achieved by distributional MIP based on shared tags 
(37% improvement). However, if scalability is important, predic- 
tions of comparable accuracy can be obtained by projecting over 
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Figure 9: Relative improvement in area under ROC curves 
over Last.fm neiglibor recommendations, for tlie most active 
users. Triples are aggregated across items (top) or tags (bot- 
tom). 



each tag, and then aggregating the similarities across tags. Col- 
laborative matching yields the best predictions in this case (35% 
improvement), followed closely by collaborative MIP (33%) and 
overlap (32%). 

In summary, these results confirm that the social network con- 
structed from semantic similarity based on user annotations cap- 
tures actual friendship more accurately than Last.fm's recommen- 
dations based on listening patterns. This suggests that the Last.fm 
neighbourhood selection could be improved by adopting tag-based 
similarity measures, especially for active users. The results are 
qualitatively similar for the other sampling methods, but the dif- 
ferences in accuracy are less significant, with the best predictions 
outperforming Last.fm by at most 3-4% in AUC for the most con- 
nected users and by 1-5% for random users. 

6. CONCLUSION AND FUTURE WORK 

In this paper we exploited one peculiarity of Flickr and Last.fm, 
namely the availability of both tagging data and the explicit social 
links between users, to investigate the interplay of the social and 
semantic aspects of Web 2.0 applications. 

We showed that strong correlations exist between user activity in 
the social context (user degree centrality and group participation) 
and the tagging activity of the same user, and that a strong assor- 
tative mixing exists in the social network; more active nodes tend 
to have as neighbors other active nodes. We also found that a lo- 
cal alignment of users' tag vocabularies is clearly visible between 
nearby users in the social network, even for social tagging systems 
that lack a notion of globally shared tag vocabulary, such as Flicki\ 
We investigated the dependence of the number of shared tags and 
the number of shared groups of two users, as a function of their 
shortest-path distance on the social network. We introduced a null 
model and we used it to show that part of the similarity between 
users who are close on the social network is due to the aforemen- 
tioned correlations between user activity and user degree centrality 
in the social network. That is, assortativity and heterogeneity alone 
can yield a comparatively higher overlap of tag usage and group 
membership for neighboring users. In this context, our work high- 



lights the importance of backing up the data analysis with carefully 
designed null models, which are necessary — as is the case here 
— to disentangle the actual signal we are looking for from effects 
arising purely from correlations and mixing properties. 

Armed with the null model methodology, we showed that it is 
possible to define measures of tag vocabulary and group member- 
ship overlap that are robust with respect to the above biases. We 
investigated the average similarity of two users, according to such 
measures, as a function of the distance in the social network, find- 
ing that a clear signal of local lexical and topical alignment can be 
detected in Flickr and Last.fm. 

The observed local alignment between lexical (tag) features on 
the social network led us to investigate the question of whether top- 
ical similarity measures based on social annotations can be applied 
to the prediction (or recommendation) of friend relations in a social 
network. Last.fm provided us with an ideal opportunity to explore 
this question thanks to the simultaneous availability of social link 
recommendations based on music listening patterns, along with the 
annotation metadata and social network. 

We were able to evaluate the predictive power of a large number 
of social topical similarity measures from the literature, spanning 
multiple aggregation/projection schemes. The results were very en- 
couraging; using any of the tested social similarity measures we 
were able to improve on the accuracy of the social link predictions 
provided by Last.fm, and the improvements were especially sig- 
nificant for users who are active taggers. Equally encouraging is 
the fact that accurate predictions are afforded even by incremental 
measures, pointing to scalable algorithms to compute social link 
recommendations or improve existing methods. 

Among the various measures we evaluated, maximum informa- 
tion path has proven very accurate across aggregation schemes, 
data sets, and sampling methodologies. When predicting social 
links between active taggers, MIP is the best measure among those 
based on distributional aggregation (regardless of whether we ag- 
gregate across items or tags), and either the best or a close second 
among the scalable measures based on collaborative aggregation, 
across items or tags respectively. 

As expected, the Last.fm neighborhood relation seems to be in- 
dependent of the tagging activity of users; we obtain very close 
AUC values for both the most active and most connected sampling 
strategies. Therefore the number of annotations considered does 
not affect the estimation of user affinity based on listening patterns. 
Accordingly, the present results suggest that the Last.fm neighbor- 
hood recommendation could benefit considerably from social sim- 
ilarity measures — especially for active users. 

Our results have important implications for the design of social 
media. As social networks and social tagging continue to become 
increasingly popular and integrated in the Web 2.0, our techniques 
can be directly applied to improve the synergies between social and 
semantic networks — specifically, to help users find friends with 
similar topical interests as well as facilitate the formation of topical 
communities. 

We plan to further validate our findings via user studies. We will 
pursue this direction by integrating a "suggest friend" functionality 
into GiveALink.org, a social bookmarking system developed 
by our group at Indiana University for research purposes. 

On the more theoretical side, future work will consider the present 
analysis performed longitudinally in time, to move from assessing 
correlations to assessing causality. We will investigate whether the 
activation of a social link induces a local alignment of tags and 
group membership, or conversely a similarity in interests triggers 
the creation of a social link. Both processes probably play an im- 
portant role in different situations, and adding a temporal dimen- 



sion to the analysis presented here will provide new insight for 
modeling the structure and evolution of user-driven systems. 
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